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Abstract 

We present an approach to email filtering based on the suffix tree data 
structure. A method for the scoring of emails using the suffix tree is 
developed and a number of scoring and score normalisation functions are 
tested. Our results show that the character level representation of emails and 
classes facilitated by the suffix tree can significantly improve classification 
accuracy when compared with the currently popular methods, such as naive 
Bayes. We believe the method can be extended to the classihcation of 
documents in other domains. 


1 Introduction 

Just as email traffic has increased over the years since its inception, so has the 
proportion that is unsolicited; some estimations have placed the proportion 
as high as 60%, and the average cost of this to business at around $2000 per 
year, per employee (see 1^ for a range of numbers and statistics on spam). 
Unsolicited emails - commonly know as spam - have thereby become a daily 
feature of every email user’s inbox; and regardless of advances in email fil¬ 
tering, spam continues to be a problem in a similar way to computer viruses 
which constantly reemerge in new guises. This leaves the research commu¬ 
nity with the task of continually investigating new approaches to sorting the 
welcome emails (known as ham) from the unwelcome spam. 

We present just such an approach to email classification and hltering 
based on a well studied data structure, the suffix tree (see for a brief 
introduction). The approach is similar to many existing ones, in that it uses 
training examples to construct a model or prohle of the class and its features, 
then uses this to make decisions as to the class of new examples; but it differs 
in the depth and extent of the anaysis. For a good overview of a number of 
text classification methods, see I26i m i3n . 

Using a suffix tree, we are able to compare not only single words, as in 
most current approaches, but substrings of an arbitrary length. Comparisons 
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of substrings (at the level of characters) has particular benefits in the domain 
of spam classification because of the methods spammers use to evade filters. 
For example, they may disguise the nature of their messages by interpolating 
them with meaningless characters, thereby fooling filters based on keyword 
features into considering the words, sprinkled with random characters, as 
completely new and unencountered. If we instead treat the words as character 
strings, and not features in themselves, we are still able to recognise the 
substrings, even if the words are broken. 

Section|2]gives examples of some of the methods spammers use to evade 
detection which make it useful to consider character level features. Section|3 
gives a brief explanation of the naive Bayes method of text classification as an 
example of a conventional approach. Sectionl^briefly introduces suffix trees, 
with some definitions and notations which are useful in the rest of the paper, 
before going on to explain how the suffix tree is used to classify text and filter 
spam. Section|3describes our experiments, the test parameters and details of 
the data sets we used. Section|3presents the results of the experiments and 
provides a comparison with results in the literature. Section0concludes. 

2 Examples of Spam 

Spam messages typically advertise a variety of products or services ranging 
from prescription drugs or cosmetic surgery to sun glasses or holidays. But 
regardless of what is being advertised, one can distinguish between the meth¬ 
ods used by the spammer to evade detection. These methods have evolved 
with the filters which attempt to extirpate them, so there is a generational as¬ 
pect to them, with later generations becoming gradually more common and 
earlier ones fading out; as this happens, earlier generations of filters become 
less effective. 

We present four examples of spam messages, the first of which illustrates 
undisguised spam while the other three illustrate one or more methods of 
evasion. 

1. Undisguised message. The example contains no obfuscation. The 
content of the message is easily identified by filters, and words like 
“Viagra” allow it to be recognised as spam. Such messages are very 
likely to be caught by the simplest word-based Bayesian classifiers. 
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Buy cheap medications online, no prescription needed. 

We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol 
and many more products. 

No embarrasing trips to the doctor, get it delivered directly to your 
door. 

Experienced reliable service. 

Most trusted name brands. 

Your solution is here; http://www.webrx-doctor.com/?rid=1000 


2. Intra-word characters. 


Get the low.est pri.ce for gen.eric medica.tions! 

Xa.n.ax - only $100 
Vi.cod.in - only $99 
Ci.al.is - only $2 per do.se 
Le.vit.ra - only $73 
Li.pit.or - only $99 
Pr.opec.ia - only $79 
Vi.agr.a - only $4 per do.se 
Zo.co.r - only $99 

Your Sav.ings 40% compared Average Internet Pr.ice! 

No Consult.ation Fe.es! No Prior Prescrip.tions Required! No 
Appoi.ntments! 

No Wait.ing Room! No Embarra.ssment! Private and Con- 
fid.ential! Disc.reet Packa.ging! 

che ck no w; 

http;//priorlearndiplomas.com/r3/?d=getanon 


The example above shows the use of intra-word characters, which may 
be non-alpha-numeric or whitespace. Here the word, “Viagra” has 
become “Vi.agr.a”, while the word “medications” has become “med¬ 
ica.tions”. To a simple word-based Bayesian classifier, these are com¬ 
pletely new words, which might have occurred rarely, or not at all, in 
previous examples. Obviously, there are a large number of variations 
on this theme which would each time create an effectively new word 
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which would not be recognised as spam content. However, if we ap¬ 
proach this email at the character level, we can still recognise strings 
such as “medica” as indicative of spam, regardless of the character 
that follows, and furthermore, though we do not deal with this in the 
current paper, we might implement a look-ahead window which at¬ 
tempts to skip (for example) non-alphabetic characters when searching 
for spammy features. 

Certainly, one way of countering such techniques of evasion is to map 
the obfuscated words to genuine words during a pre-processing stage, 
and doing this will help not only word-level filters, but also character- 
level filters because an entire word match, either as a single unit, or a 
string of characters, is better than a partial word match. 

However, some other methods may not be evaded so easily in the same 
way, with each requiring its own special treatment; we give two more 
examples below which illustrate the point. 

3. Word salad. 


Buy meds online and get it shipped to your door Find out more 
here 

http;//www.gowebrx.com/?rid= 1001 

a publications website accepted definition, known are can 
Commons the be dehnition. Commons UK great public principal 
work Pre-Budget but an can Majesty’s many contains statements 
statements titles (eg includes have website, health, these Com¬ 
mittee Select undertaken described may publications 


The example shows the use of what is sometimes called a word salad 
- meaning a random selection of words. The first two lines of the mes¬ 
sage are its real content; the paragraph below is a paragraph of words 
taken randomly from what might have been a government budget re¬ 
port. The idea is that these words are likely to occur in ham, and would 
lead a traditional algorithm to classify this email as such. Again, ap¬ 
proaching this at the character level can help. For example, say we 
consider strings of length 8, strings such as “are can” and “an can”, 
are unlikely to occur in ham, but the words “an”, “are” and “can” may 
occur quite frequently. Of course, in most ’bag-of-words’ implemen¬ 
tations, words such as these are pruned from the feature set, but the 
argument still holds for other bigrams. 

4. Embedded message (also contains a word/letter salad). The exam¬ 
ple below shows an embedded message. Inspection of it will reveal that 
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it is actually offering prescription drugs. However, there are no eas¬ 
ily recognised words, except those that form the word salad, this time 
taken from what appear to be dictionary entries under ’z’. The value 
of substring searching is highly apparent in this case as it allows us to 
recognise words such as “approved”, “Viagra” and “Tablets”, which 
would otherwise be lost among the characters pressed up against them. 


zygotes zoogenous zoometric zygosphene zygotactic zygoid 
zucchettos zymolysis zoopathy zygophyllaceous zoophytologist 
zygomaticoauricular zoogeologist zymoid zoophytish zoospores 
zygomaticotemporal zoogonous zygotenes zoogony zymosis 
zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic 
zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic 
zoolatrous zoophilous zymotically zymosterol 

FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJT 

* GetJIIXQLDViagraPWXJXFDUUTabletsNXZXVRCBX 
http://healthygrow.biz/index.php?id=2 

zonally zooidal zoospermia zoning zoonosology zooplankton 
zoochemical zoogloeal zoological zoologist zooid zoosphere 
zoochemical 

& Safezoonal andNGASXHBPnatural 
& TestedQLOLNYQandEAVMGFCapproved 

zonelike zoophytes zoroastrians zonular zoogloeic zoris 
zygophore zoograft zoophiles zonulas zygotic zymograms 
zygotene zootomical zymes zoodendrium zygomata zoometries 
zoographist zygophoric zoosporangium zygotes zumatic zygo- 
maticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia 
zygodactylism zygotenes zoopathological noZFYFEPBmas 
http; //healthy grow, biz/remove. php 


These examples are only a sample of all the types of spam that exist, for 
an excellent and often updated list of examples and categories, see cniiii. 
Under the categories suggested in Ea, example 2 and 4 would count as 
’Tokenisation’ and/or ’Obfuscation’, while examples 2 and 3 would count as 
’Statistical’. 

We look next at a bag-of-words approach, naive Bayes, before consider- 
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ing the suffix tree approach. 


3 Naive Bayesian Classification 

Naive Bayesian (NB) email filters currently attract a lot of research and com¬ 
mercial interest, and have proved highly successful at the task; and Eli 
are both excellent studies of this approach to email filtering. We do not give 
detailed attention to NB as it is not the intended focus of this paper; for a 
general discussion of NB see ini, for more context in text categorisation 
see 1261 ■ and for an extension of NB to the classification of structured data, 
see (8). However, an NB classifier is useful in our investigation of the suffix 
tree classifier, and in particular, our own implementation of NB is necessary 
to investigate experimental conditions which have not been explored in the 
literature. We therefore briefly present it here. 

We begin with a set of training examples with each example document 
assigned to one of a fixed set of possible classes, C = {ci, C2, C3,... cy}. An 
NB classifier uses this training data to generate a probabilistic model of each 
class; and then, given a new document to classify, it uses the class models and 
Bayes’ rule to estimate the likelihood with which each class generated the 
new document. The document is then assigned to the most likely class. The 
features, or parameters, of the model are individual words; and it is ’naive’ 
because of the simplifying assumption that, given a class, each parameter is 
independent of the others. 

(H distinguish between two types of probabilistic models which are 
commonly used in NB classifiers: the multi-variate Bernoulli event model 
and the multinomial event model. We adopt the latter, under which a docu¬ 
ment is seen as a series of word events and the probability of the event given 
a class is estimated from the frequency of that word in the training data of 
the class. 

Hence, given a document d = {did2dT,...dL}, we use Bayes theorem to 
estimate the probability of a class, cj: 



/.(«. Id) 

^ ' 1 p(d) 

(1) 

Assuming that words 

are independent given the category, this leads to: 



Pf. nil mi cj) 

(2) 

We estimate P(C;) as: 

P(C = cj) = ^ 

(3) 

and P(d,- | cj) as: 


(4) 
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Figure 1: A Suffix Tree after the insertion of “meet”. 


where Nij is the number of times word i occurs in class j (similarly for Nkj) 
and M is the total number of words considered. 

To classify a document we calculate two scores, for spam and ham, and 
take the ratio, hsr = / classify the document as ham if it is above 
a threshold, th, and as spam if it is below (see Section l5.1.3> . 

4 Suffix Tree Classification 

4.1 Introduction 

The suffix tree is a data storage and fast search technique which has been 
used in fields such as computational biology for applications such as string 
matching applied to DNA sequences |4l 1171 . To our knowledge it has not 
been used in the domain of natural language text classification. 

We adopted a conventional procedure for using a suffix tree in text clas¬ 
sification. As with NB, we take a set of documents D which are each known 
to belong to one class, Cj, in a set of classes, C, and build one tree for each 
class. Each of these trees is then said to represent (or profile) a class (a tree 
built from a class will be referred to as a “class tree”). 

Given a new document d, we score it with respect to each of the class 
trees and the class of the highest scoring tree is taken as the class of the 
document. 

We address the scoring of documents in Section lA^ but first, we consider 
the consttuction of the class ttee. 
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Figure 2: A Suffix Tree after insertion of strings “meet” and “feet”. 


4.2 Suffix Tree Construction 

We provide a brief inttoduction to suffix tree construction. For a more de¬ 
tailed treatment, along with algorithms to improve computational efficiency, 
the reader is directed to CD Our representation of a suffix tree differs from 
the literature in two ways that are specific to our task; first, we label nodes 
and not edges, and second, we do not use a special terminal character. The 
former has little impact on the theory and allows us to associate frequencies 
directly with characters and subsU'ings. The later is simply because our in¬ 
terest is actually focused on substrings rather than suffixes; the inclusion of 
a terminal character would therefore not aid our algorithms, and its absence 
does not hinder them. Furthermore, our trees are depth limited, and so the 
inclusion of a terminal character would be meaningless in most situations. 

Suppose we want to construct a suffix tree from the string, s = ‘"meet". 
The string has four suffixes: s(l) = “meet”, s(2) = “eet”, s(3) = “et”, and 



We begin at the root of the tree and create a child node for the first char¬ 
acter of the suffix i(l). We then move down the tree to the newly created 
node and create a new child for the next character in the suffix, repeating this 
process for each of the characters in this suffix. We then take the next suffix, 
i(2), and, starting at the root, repeat the process as with the previous suffix. 
At any node, we only create a new child node if none of the existing children 
represents the character we are concerned with at that point. When we have 
entered each of the suffixes, the resulting tree looks like that in Figure 1. Each 
node is labelled with the character it represents and its frequency. The node’s 
position also represents the position of the character in the suffix, such that 
we can have several nodes labelled with the same character, but each child 
of each node (including the root) will carry a character label which is unique 
among its siblings. 
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If we then enter the string, t = “feet”, into the tree in Figure ^ we ob¬ 
tain the tree in Figure |2 The new tree is almost identical in structure to 
the previous one because the suffixes of the two strings are all the same but 
for f(l) = “feet”, and as we said before, we need only create a new node 
when an appropriate node does not already exist, otherwise, we need only 
increment the frequency count. 

Thus, as we continue to add more strings to the tree, the number of nodes 
in the tree increases only if the new string contains substrings which have 
not previously been encountered. It follows that given a fixed alphabet and a 
limit to the length of substrings we consider, there is a limit to the size of the 
tree. Practically, we would expect that, for most classes, as we continue to 
add strings to the class tree, the tree will increase in size at a decreasing rate, 
and will quite likely stabilise. 

4.3 Class Trees and their Characteristics 

For any string s we designate the character of s by sf, the suffix of s 
beginning at the character by s(/); and the substring from the to the 
character inclusively by s(i,j). 

Any node, n, labelled with a character, c, is uniquely identified by the 
path from the wot to n. For example, consider the tree in Figure |2 There 
are several nodes labelled with a “f”, but we can distinguish between node 
n = (“t” given “mee”) = {t\mee) and p = (“f” given “ee”) = (f|ee); these 
nodes are labelled n and p in Figure |2] We say that the path of n is = 
“mee”, and the path of p is p = “ee”; furthermore, the frequency of n is 
1, whereas the frequency of p is 2; and saying n has a frequency of 1, is 
equivalent to saying the frequency of “t” given “mee” is 1, and similarly for 
P- 

If we say that the wot node, r, is at level zero in the tree, then all the 
children of r are at level one. More generally, we can say that the level of 
any node in the tree is one plus the number of letters in its path. For example, 
level{n) = 4 and level{p) = 3. 

The set of letters forming the first level of a tree is the alphabet, E - 
meaning that all the nodes of the tree are labelled with one of these letters. 
For example, considering again the tree in Figure |2j its first level letters are 
the set, E = and all the nodes of the tree are labelled by one of 

these. 

Suppose we consider a class, C, containing two strings (which we might 
consider as documents), s = “meet” and t = “feet”. Then we can refer to the 
tree in Figure|2]as the class tree of C, or the sujfix tree profile of C; which we 
denote by Tc- 

The size of the tree, \Tc\, is the number of nodes it has, and it has as 
many nodes as C has unique substrings. For instance, in the case of the tree 
in Figure|2| 
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UC = uiiiqueSubstrings(C) = 


meet^mee, me, m, eef, ee, e, ef, f, 
feetjeejej 


\UC\ = |uiiiqueSubstrings(C)| = 13 
\Tc\ = numberOfNodes(7c) = 13 

This is clearly not the same as the total number of substrings (tokens) in 



\AC\ = |AllSubstrings(C)| =20 

As an example, note that the four “e”s in the set are in fact the substrings 
s(l,l), s(2,2), t( 1,1) and t(2,2). 

Furthermore, as each node in the tree, Tc, represents one of the substrings 
in UC, the size of the class, AC, is equal to the sum of the frequencies of 
nodes in the tree Tc- 

\AC\ = |allSubstrings(C)| = sum0fFrequencies(7c) =20 

In a similar way, the suffix tree allows us to read off other frequencies 
very quickly and easily. For example, if we want to know the number of 
characters in the class C, we can sum the frequencies of the nodes on the first 
level of the tree; and if we want to know the number of substrings of length 
2, we can sum the frequencies of the level two nodes; and so on. 

This also allows us to very easily estimate probabilities of substrings of 
any length (up to the depth of the tree), or of any nodes in the tree. For exam¬ 
ple, we can say from the tree in Figure|2] that the probability of a substring, u, 
of length two, having the value, u = “ee”, given the class C, is the frequency, 
/, of the node n = (e|e), divided by the sum of the frequencies of all the level 
two nodes in the tree Tc: 


estimatedTotalProbabilitv(M) =- 

^ ^ LieNjii) 


(5) 


where A„ is the set of all nodes at same level as u. 

Similarly one can estimate the conditional probability of u as the frequency 
of u divided by the sum of the frequencies of all the children of m’s parent: 
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( 6 ) 


estimatedConditionalProbability(M) 
where is the set of all children of m’s parent. 


/(») 

'Lienufif) 


Throughout this paper, whenever we mention p{u), we mean the second 
of these (formula 0): the conditional probability of a node u. 

4.4 Classification using Suffix Trees 

Researchers have tackled the problem of the construction of a text classifier 
in a variety of different ways, but it is popular to approach the problem as 
one that consists of two parts: 

1. The definition of a function, C5 ,: D~^R, where D is the set of all doc¬ 
uments; such that, given a particular document, d, the function returns 
a category score for the class i. The score is often normalised to ensure 
that it falls in the the region [0,1], but this is not strictly necessary, par¬ 
ticularly if one intends, as we do, simply to take as the class prediction 
the highest scoring class (see Part |2] below). The interpretation of the 
meaning of the function, CS, depends on the approach adopted. For 
example, as we have seen, in naive Bayes, CS{d), is interpreted as a 
probability, whereas in other approaches such as Rocchio 1231 . CS{d) 
is interpreted as a distance or similarity measure between two vectors. 

2. A decision mechanism which determines a class prediction from set 
of class scores. For example, the highest scoring class might be taken 
as the predicted class: PC = argmaXcjec{CSj{d)}. Alternatively, if 
CS{d) is interpreted as a value with definite range, such as a probabil¬ 
ity, the decision may be based on a threshold, th, such that the predicted 
class is taken as cj if CSj{d) > th, and as not Cj otherwise. 

111161 refer to probabilistic models such as naive Bayes as parametric 
classifiers because they attempt to use the training data to estimate the pa¬ 
rameters of a probability distribution, and assume that the estimated distri¬ 
bution is correct. Non-parametric, geomtric models, such as Rocchio (23), 
instead attempt to produce a profile or summary of the training data and use 
this profile to query new documents to decide their class. 

It is possible to approach the construction of a suffix tree classifier in ei¬ 
ther of these two ways and indeed a probability-based approach has been de¬ 
veloped by for use in gene sequence matching. However, 0| did not find 
the suffix tree entirely convenient for developing a probabilistic framework 
and instead developed a probabilistic analogue to the suffix tree and used this 
modified data structure to develop probabilistic matching algorithms. 

In this paper, we retain the original structure of the suffix tree and favour 
a non-parametric, or geometric, approach to classifier construction. In such 
a framework a match between a document and a suffix tree profile of a class 
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is a set of coinciding substrings each of which must be scored individually 
so that the total score is the sum of individual scores. This is analogous to 
the inner product between a document vector and class profile vector in the 
Rocchio algorithm 1^ . We did experiment with probabilistic models, and 
found that it was possible to construct one without altering the structure of 
the suffix tree (indeed, some of the flavours of the scoring system we present 
can be seen as approximating a probabilistic approach (see Section 14.4.11 
PartfT3 even though the branches are not independent: each corresponds to 
a set of strings which may overlap. However we found that additive scoring 
algorithms performed better and in the current paper we describe only this 
approach. The method, the details of which are presented in the next section, 
is governed by two heuristics: 

HI Each substring s{i) that a string s has in common with a class T indi¬ 
cates a degree of similarity between s and T, and the longer the com¬ 
mon substrings the greater the similarity they indicate. 

H2 The more diverse* a class T, the less significant is the existence of a 
particular common substring s{i,j) between a string s and the class T . 

Turning to the second issue in classifier construction, for our current two- 
class problem, we take the ratio of two scores, hsr = just as we did 

in the case of our naive Bayesian classifier, and classify the document as ham 
if the ratio is greater than a threshold, th, and as spam if the ratio is below th. 

By raising and lowering this threshold we can change the relative importance 
we place on miss-classified spam and ham messages (see Section l5.1.3t . 

4.4.1 Scoring 

The suffix tree representation of a class is richer than the vector representa¬ 
tion of more traditional approaches and in developing a scoring method we 
can experiment with various properties of the tree, each of which can be seen 
as reflecting certain properties of the class and its members. 

We begin by describing how to score a match between a string and a 
class, then extend this to encompass document scoring. Conceptually divid¬ 
ing the scoring in this way allowed us to introduce and experiment with two 
levels of normalisation: match-level, reflecting information about strings; 
and tree-level, reflecting information about the class as a whole. The end of 
this section elaborates on the underlying motivation for the described scoring 
method. 

1. Scoring a match 

(a) We define a match as follows: A string s has a match m = m{s,T) 
in a tree T if there exists in T a path ^ — m, where m is a prefix 
of s. 

^Diversity is here an intuitive notion which the scoring method attempts to define and represent 
in a number of different ways. 
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Clearly, the match m may represent several substrings that are 
common between s and T. However, it is important to note that if 
\m\ > 1 and mo = i,, then we would expect to find another match 
m' beginning at such that |m'| > |m| — 1, hence we can think 
of m as representing only those substrings common to both s and 
T which begin with mo and still be sure that the set of all matches, 
M, between s and T will represent each common substring. 

(b) The score, score{m), for a match m = momim 2 ...m„, has two 
parts, firstly, the scoring of each character (and thereby, each sub¬ 
string), m,-, with respect to its conditional probability, using a sig¬ 
nificance function of probability, (j) \p\ (defined below in partfTcl). 
and secondly, the adjustment (normalisation), v{m\T ), of the score 
for the whole match with respect to its probability in the tree: 

n 

score{m) = v(m|7’) ^ 0[p(m,)] (7) 

i=0 

Using the conditional probability rather than the total probabil¬ 
ity has the benefit of supporting heuristic HI: as we go deeper 
down the tree, each node will tend to have fewer children and so 
the conditional probability will be likely to increase; conversely, 
there will generally be an increasing number of nodes at each level 
and so the total probability of a particular node will decrease. In¬ 
deed we did experiment with the total probability and found that 
performance was significantly decreased. 

Furthermore, by using the conditional probability we also only 
consider the independent parts of features when deriving scores. 
So for example, if m = “abc”, by the time we are scoring the 
feature represented by “abc”, we have already scored the feature 
“ab”, so we need only score “c” given “ab”. 

(c) A function of probability, ^[p], is employed as a significance 
function because it is not always the most frequently occurring 
terms or strings which are most indicative of a class. For exam¬ 
ple, this is the reason that conventional pre-processing removes 
all stop words, and the most and least frequently occurring terms; 
however, by removing them completely we give them no signif¬ 
icance at all, when we might instead include them, but reduce 
their significance in the classification decision. Functions on the 
probability can help to do this, especially in the absence of all 
pre-processing, but that still leaves the question of how to weight 
the probabilities, the answer to which will depend on the class. 

In the spam domain, some strings will occur very infrequently 
(consider some of the strings resulting from intra-word characters 
in the examples of spam in Section |2l above) in either the spam 
or ham classes, and it is because they are so infrequent that they 
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are indicative of spam. Therefore, under such an argument, rather 
than remove such terms or strings, we should actually increase 
their weighting. 

Considerations such as these led to experimentation with a 
number of specifications of the significance function, ^[p]: 


0[p] = < 


1 

P 

VP 

IniP)-ln{l - p) 
1 

l+exp(-p) 


constant 

linear 

square 

root 

logit 

sigmoid 


The first three functions after the constant are variations of the 
linear (linear, sub-linear and super-linear). The last two are vari¬ 
ations on the S-curve; we give above the simplest forms of the 
functions, but in fact, they must be adjusted to fit in the range 
[0,1]. 

Although in this paper we are not aiming to develop a prob¬ 
abilistic scoring method, note that the logistic significance func¬ 
tion applied to formula 0 may be considered an approximation 
of such an approach since we generally have a large alphabet and 
therefore a large number of children at each node, and so for most 
practical purposes ln(l — p) « 0. 

(d) Turning our attention to match-level normalisation, we experi¬ 
mented with three specifications of v{m\T): 


match unnormalised 
match permutation normalised 

match length normalised 

where m* is the set of all the strings in T formed by the permu¬ 
tations of the letters in m; and m' is the set of all strings in T of 
length equal to the length of m. 

Match permutation normalisation (MPN) is motivated by heuristic H2. 
The more diverse a class (meaning that it is represented by a relatively 
large set of substring features), the more combinations of characters 
we would expect to find, and so finding the particular match m is less 
significant than if the class were very narrow (meaning that it is fully 
represented by a relatively small set of substring features). Reflecting 
this, the MPN parameter will tend towards 1 if the class is less diverse 
and towards 0 if the class is more diverse. 


v(m|7’)=<^ lr) ./(<■) 

/(min 

L T,e(m'|r)/(') 
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Match length normalisation (MLN) is motivated by examples from 
standard linear classifiers (see 113 for an overview), where length nor¬ 
malisation of feature weights is not uncommon. However, MLN ac¬ 
tually runs counter to heuristic HI because it will tend towards 0 as 
the match length increases. We would therefore expect MLN to reduce 
the performance of the classifier; thus MLN may serve as a test of the 
intuitions governing heuristic HI. 

2. Scoring a document 

(a) To score an entire document we consider each suffix of the docu¬ 
ment in turn and score any match between that suffix and the class 
tree. Thus the score for a document s is the sum: 

n 

SCORE{s,T)= Y^score{s{i),T) (8) 

1=0 

where the score{s{i),T) searches for a match, m, between suffix 
s{i) and tree T, and if one is found, scores it according to for¬ 
mula 0. 

We experimented with a number of approaches to tree-level 
normalisation of the sum in (|3 motivated again by heuristic H2 
and based on tree properties such as size, as a direct reflection of 
the diversity of the class; density (defined as the average number 
of children over all internal nodes), as an implicit reflection of the 
diversity; and total and average frequencies of nodes as an indi¬ 
cation of the size of the class^; but found none of our attempts to 
be generally helpful to the performance of the classifier. Unfortu¬ 
nately, we do not have space in this paper to further discuss this 
aspect. 

The underlying mechanism of the scoring function can be grasped by 
considering its simplest configuration: using the constant significance func¬ 
tion, with no normalisation. If the scoring method were used in this form to 
score the similarity between two strings, it would simply count the number 
of substrings that the two strings have in common. For example, suppose we 
have a string t = ‘"abed”. If we were to apply this scoring function to assess¬ 
ing the similarity that t has with itself, we would obtain a result of 11, because 
this is the number of unique substrings that exist in f. If we then score the 
similarity between t and = “Xbed”, we obtain a score of 6, because the 
two strings share 6 unique substrings; similarly, a string f' = ‘"aXed” would 
score 4. 

Another way of viewing this is to think of each substring of t as represent¬ 
ing a feature in the class that f represents. The scoring method then weights 

^Class size is defined as the total number of substrings in the documents of the class, and tree 
size as the number of nodes in the tree, that is, the number of unique substrings in the class (see 
Section lOl . 
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each of these as 1 if they are present in a query string and 0 otherwise, in a 
way that is analogous to the simplest form of weighting in algorithms such 
as Rocchio. 

Once seen in this way, we can consider all other flavours of the classi¬ 
fier as experimenting with different approaches to deciding how significant 
each common substring is, or in other words, deciding how to weight each 
class feature - in much the same as with other non-pammetric classifier al¬ 
gorithms. 

5 Experimental Setup 

All experiments were conducted under ten-fold cross validation. We accept 
the point made by 1^ that such a method does not reflect the way classifiers 
are used in practice, but the method is widely used and serves as a thorough 
initial test of new approaches. 

We follow convention by considering as true positives (TP), spam mes¬ 
sages which are correctly classified as spam; false positives (FP) are then ham 
messages which are incorrectly classified as spam; false negatives (FN) are 
spam incorrectly classified as ham; true negatives (TN) are ham messages 
correctly classified as ham. See Section 1531 for more on the performance 
measurements we use. 

5.1 Experimental Parameters 

5.1.1 Spam to Ham Ratios 

From some initial tests we found that success was to some extent contingent 
on the proportion of spam to ham in our data set - a point which is identified, 
but not systematically investigated in other work 1201 - and this therefore 
became part of our investigation. The differing results further prompted us to 
introduce forms of normalisation, even though we had initially expected the 
probabilities to take care of differences in the scale and mix of the data. Our 
experiments used three different ratios of spam to ham; 1;1,4:6, 1;5. The first 
and second of these (1; 1 and 4;6) were chosen to reflect some of the estimates 
made in the literature of the actual proportions of spam in current global 
email traffic. The last of these (1:5) was chosen as the minimum proportion 
of spam included in experiments detailed in the literature, for example in (2|. 

5.1.2 Tree Depth 

It is too computationally expensive to build trees as deep as emails are long. 
Furthermore, the marginal performance gain from increasing the depth of a 
tree, and therefore the length of the substrings we consider, may be negative. 
Certainly, our experiments show a diminishing marginal improvement (see 
Section l6. 2. H . which would suggest a maximal performance level, which 
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may not have been reached by any of our trials. We experimented with depths 
of length of 2, 4, 6, and 8. 

5.1.3 Threshold 

From initial trials, we observed that the choice of threshold value in the clas¬ 
sification criterion can have a significant, and even critical, effect on per¬ 
formance, and so introduced it as an important experimental parameter. We 
used a range of threshold values between 0.7 and 1.3, with increments of 0.1, 
with a view to probing the behaviour of the scoring system. 

Varying the threshold is equivalent to associating higher costs with either 
false positives or false negatives because checking that (ct/jS) > f is equiva¬ 
lent to checking that a > tp. 

5.2 Data 

Three corpora were used to create the training and testing sets; 

1. The Ling-Spam corpus (LS) 

This is available from: http: //www. aueb. gr/users/ion/data/lingspam_public . tar. gz 

The corpus is that used in Q. The spam messages of the corpus were 

collected by the authors from emails they received. The ham messages 

are taken from postings on a public online linguist bulletin board for 

professionals; the list was moderated, so does not contain any spam. 

Such a source may at first seem biased, but the authors claim that this 
is not the case. There are a total of 481 spam messages and 2412 ham 
messages, with each message consisting of a subject and body. 

When comparing our results against those of Q in Section I01 we 
use the complete data set, but in further experiments, where our aim 
was to probe the properties of the suffix tree approach and investigate 
the effect of different proportions of spam to ham messages, we use 
a random subset of the messages so that the sizes and ratios of the 
experimental data sets derived from this source are the same as data 
sets made up of messages from other sources (see Table [Obelow). 

2. Spam Assassin public corpus (SA) 

This is available from: http: //spamiassassin. org/publiccorpus 
The corpus was collected from direct donations and from public forums 
over two periods in 2002 and 2003, of which we use only the later. The 
set from 2003 comprise a total of 6053 messages, approximately 31% 
of which are spam. The ham messages are split into ’easy ham’ (SAe) 
and ’hard ham’ (SAh), the former being again split into two groups 
(SAe-Gl and SAe-G2); the spam is similarly split into two groups 
(SAs-Gl and SAs-G2), but there is no distinction between hard and 
easy. The compilers of the corpus describe hard ham as being closer in 
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many respects to typical spam: use of HTML, unusual HTML markup, 
coloured text, “spammish-sounding” phrases etc.. 

In our experiments we use ham from the hard group and the second 
easy group (SAe-G2); for spam we use only examples from the sec¬ 
ond group (SAs-G2). Of the hard ham there are only 251 emails, but 
for some of our experiments we required more examples, so whenever 
necessary we padded out the set with randomly selected examples from 
group G2 of the easy ham (SAe-G2); see Tabled The SA corpus repro¬ 
duces all header information in full, but for our purposes, we extracted 
the subjects and bodies of each; the versions we used are available at: 
http://dcs.bbk.ac.uk/~rajesh/spamcorpora/spamassassinOS.zip 

3. The BBKSpam04 corpus (BKS) 

This is available at: http: //dcs . bbk. ac . uk/~raj esh/spamcorpora/bbkspam04. zip 
This corpus consists of the subjects and bodies of 600 spam messages 
received by the authors during 2004. The Birkbeck School of Com¬ 
puter Science and Information Systems uses an installation the Spa- 
mAssassin filter O with default settings, so all the spam messages in 
this corpus have initially evaded that filter. The corpus is further fil¬ 
tered so that no two emails share more than half their substrings with 
others in the corpus. Almost all the messages in this collection contain 
some kind of obfuscation, and so more accurately reflect the current 
level of evolution in spam. 

One experimental email data set (EDS) consisted of a set of spam and a 
set of ham. Using messages from these three corpora, we created the EDSs 
shown in Table ^ The final two numbers in the code for each email data 
set indicate the mix of spam to ham; three mixes were used: 1:1, 4:6, and 
1:5. The letters at the start of the code indicate the source corpus of the set’s 
spam and ham, respectively; hence the grouping. Eor example, EDS SAe-46 
is comprised of 400 spam mails taken from the group SAs-G2 and 600 ham 
mails from the group SAe-G2, and EDS BKS-SAeh-15 is comprised of 200 
spam mails from the BKS data set and 1000 ham mails made up of 800 mails 
from the SAe-G2 group and 200 mails from the SAh group. 

5.2.1 Pre-processing 

Eor the suffix tree classifier, no pre-processing is done. It is likely that some 
pre-processing of the data may improve the performance of an ST classifier, 
but we do not address this issue in the current paper. 

Eor the the naive Bayesian classifier, we use the following standard three 
pre-processing procedures: 

1. Remove all punctuation. 

2. Remove all stop-words. 

3. Stem all remaining words. 
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Table 1: Composition of Email Data Sets (EDSs) used in the experi¬ 
ments. 


EDS Code 

Spam Souree 
(number from source) 

Ham Source 
(number from source) 

ES-EUEE 

ES (481) 

ES (2412) 

ES-11 

ES (400) 

ES (400) 

ES-46 

ES (400) 

ES (600) 

ES-15 

ES (200) 

ES (1000) 

SAe-11 

SAs-G2 (400) 

SAe-G2 (400) 

SAe-46 

SAs-G2 (400) 

SAe-G2 (600) 

SAe-15 

SAs-G2 (200) 

SAe-G2 (1000) 

SAeh-11 

SAs-G2 (400) 

SAe-G2 (200) -t SAh (200) 

SAeh-46 

SAs-G2 (400) 

SAe-G2 (400) + SAh (200) 

SAeh-15 

SAs-G2 (200) 

SAe-G2 (800) -t SAh (200) 

BKS-ES-11 

BKS (400) 

ES (400) 

BKS-ES-46 

BKS (400) 

ES (600) 

BKS-ES-15 

BKS (200) 

ES (1000) 

BKS-SAe-11 

BKS (400) 

SAe-G2 (400) 

BKS-SAe-46 

BKS (400) 

SAe-G2 (600) 

BKS-SAe-15 

BKS (200) 

SAe-G2 (1000) 

BKS-SAeh-11 

BKS (400) 

SAe-G2 (200) -t SAh (200) 

BKS-SAeh-46 

BKS (400) 

SAe-G2 (400) -t SAh (200) 

BKS-SAeh-15 

BKS (200) 

SAe-G2 (800) + SAh (200) 
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Words are taken as strings of characters separated from other strings by one 
or more whitespace characters (spaces, tabs, newlines). Punctuation is 
removed first in the hope that many of the intra-word characters which 
spammers use to confuse a Bayesian filter will be removed. Our stop-word 
list consisted of the 57 of the most frequent prepositions, pronouns, articles 
and conjunctives. Stemming was done using an implementation of Porter’s 
1980 algorithm, more recently reprinted in ca. All words less than three 
characters long are ignored. For more general information on these and 
other approaches to pre-processing, the reader is directed to 11811311 . 


5.3 Performance Measurement 

There are generally two sets of measures used in the literature; here we in¬ 
troduce both in order that our results may be more easily compared with 
previous work. 

Following 1 ^ . ( 21 , and others, the first set of measurement parameters 
we use are recall and precision for both spam and ham. For spam (and simi¬ 
larly for ham) these measurements are defined as follows; 


Spam Recall {SR) = , Spam Precision (SP) = 

where XY means the number of items of class X assigned to class Y; with S 
standing for spam and H for ham. Spam recall measures the proportion of all 
spam messages which were identified as spam and spam precision measures 
the proportion of all messages classified as spam which truly are spam; and 
similarly for ham. 

However, it is now more popular to measure performance in terms of true 
positive (TP) wA false positive (FP) rates: 


TPR = 


ss 

SS+SH ’ 


FPR = 


HS 

HH+HS 


The TPR is then the proportion of spam correctly classified as spam and the 
FPR is the proportion of ham incorrectly classified as spam. Using these 
measures, we plot in Section what are generally referred to as receiver 
operator curves (ROC) (71 to observe the behaviour of the classifier at a range 
of thresholds. 

To precisely see performance rates for particular thresholds, we also found 
it useful to plot, against threshold, false positive rates (FPR) and false nega¬ 
tive rates (FNR): 

FNR = 1 - TPR 


Effectively, FPR measures errors in the classification of ham and FNR 
measures errors in the classification of spam. 
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Table 2: Results of (Androutsopoulos et al., 2000) on the Ling- 
Spam corpus. In the pre-processing column: ’bare’ indicates no pre¬ 
processing. The column labelled ’No. of attrib.’ indicates the number 
of word features which the authors retained as indicators of class. Re¬ 
sults are shown at the bottom of the table from ST classification using 
a linear significance function and no normalisation; for the ST classi¬ 
fier, we performed no pre-processing and no feature selection. 



Pre-processing 

No. of 

attrib. 

th 

SR(%) 

SP(%) 

NB 

(a) bare 

50 

1.0 

81.10 

96.85 


(b) stop-list 

50 

1.0 

82.35 

97.13 


(c) lemmatizer 

100 

1.0 

82.35 

99.02 


(d) lemmatizer -i- stop-list 

100 

1.0 

82.78 

99.49 


(a) bare 

200 

0.11 

76.94 

99.46 


(b) stop-list 

200 

0.11 

76.11 

99.47 


(c) lemmatizer 

100 

0.11 

77.57 

99.45 


(d) lemmatizer -i- stop-list 

100 

0.11 

78.41 

99.47 


(a) bare 

200 

0.001 

73.82 

99.43 


(b) stop-list 

200 

0.001 

73.40 

99.43 


(c) lemmatizer 

300 

0.001 

63.67 

100.00 


(d) lemmatizer -i- stop-list 

300 

0.001 

63.05 

100.00 

ST 

bare 

N/A 

1.00 

97.50 

99.79 


bare 

N/A 

0.98 

96.04 

100.00 


6 Results 

We begin in Section lbdl bv comparing the results of the suffix tree (ST) ap¬ 
proach to the reported results for a naive Bayesian (NB) classifier on the 
the Ling Spam corpus. We then extend the investigation of the suffix tree 
to other data sets to examine its behaviour under different conditions and 
configurations. To maintain a comparative element on the further data sets 
we implemented an NB classifier which proved to be competitive with the 
classifier performance as reported in |2l and others. In this way we look at 
each experimental parameter in turn and its effect on the performance of the 
classifier under various configurations. 
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Table 3: Results of in-house nave Bayes on the LS-FULL data set, 
with stop-words removed and all remaining words lemmatized. The 
number of attributes was unlimited, but, for the LS-FULL data set, in 
practice the spam vocabulary was approximately 12,000, and the ham 
vocabulary approximately 56,000, with 7,000 words appearing in both 
classes. 



Pre-processing 

No. of 

attrib. 

th 

SR(%) 

SP(%) 

NB* 

lemmatizer -i- stop-list 

unlimited 

1.0 

99.16 

97.14 


lemmatizer -i- stop-list 

unlimited 

0.94 

89.58 

100.00 


6.1 Assessment 

Table |2] shows the results reported in 0, from the application of their NB 
classifier on the LS-FULL data set, and the results of the ST classifier, using 
a linear significance function with no normalisation, on the same data set. 

As can be seen, the performance levels for precision are comparable, but 
the suffix tree simultaneously achieves much better results for recall. 

0 test a number of thresholds ^ (th) and found that their NB filter achieves 
a 100% spam precision (SP) at a threshold of 0.001. We similarly tried 
a number of thresholds for the ST classifier, as previously explained (see 
Section l5dl . and found that 100% SP was achieved at a threshold of 0.98. 
Achieving high SP comes at the inevitable cost of a lower spam recall (SR), 
but we found that our ST can achieve the 100% in SP with less cost in terms 
of SR, as can be seen in the table. 

As stated in the table (and previously: see Section l5.2.1t . we did no pre¬ 
processing and no feature selection for the suffix tree. However, both of 
these may well improve performance, and we intend to investigate this in 
future work. 

As we mentioned earlier (and in Section 0, we use our own NB classi¬ 
fier in our further investigation of the performance of our ST classifier. We 
therefore begin by presenting in Tabled the results of this classifier (NB*) 
on the LS-FULL data set. As the table shows, we found our results were, at 
least in some cases, better than those reported in 0. This is an interesting 
result which we do not have space to investigate fully in this paper, but there 
are a number of differences in our naive Bayes method which may account 
for this. 

Firstly 0 uses a maximum of 300 attributes, which may not have been 

^(Androutsopoulos et at. 2000) do not actually quote the threshold, but a ’cost value’, which we 
have converted into its threshold equivalent. 
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Table 4: Precision-recall breakeven 
points on the LS-FULL data set. 


Classifier 

Spam (%) 

Ham (%) 

NB' 

96.47 

99.34 

NB* 

94.96 

98.82 

ST 

98.75 

99.75 


enough for this domain or data set, whereas we go to the other extreme of 
not limiting our number of attributes, which would normally be expected to 
ultimately reduce performance, but only against an optimal number, which 
is not necessarily the number used by (21 . Indeed, some researchers o 

have found NB does not always benefit from feature limitation, while 
others have found the optimal number of features to be in the thousands or 
tens of thousands II25II19I . Secondly, there may be significant differences in 
our pre-processing, such as a more effective stop-word list and removal of 
punctuation; and thirdly, we estimate the probability of word features using 
Laplace smoothing (see formula^}, which is more robust than the estimated 
probability quoted by (2| . 

There may indeed be further reasons, but it is not our intension in this 
paper to analyse the NB approach to text classification, but only to use it as a 
comparative aid in our investigation of the performance of the ST approach 
under various conditions. Indeed, other researchers have extensively investi¬ 
gated NB and for us to conduct the same depth of investigation would require 
a dedicated paper. 

Furthermore, both our NB* and ST classifiers appear to be competitive 
with quoted results from other approaches using the same data set. For exam¬ 
ple in Ea, the author experiments on the Ling-Spam data set with different 
models of NB and different methods of feature selection, and achieves results 
approximately similar to ours. (2^ quotes “breakeven” points, defined as the 
“highest recall for which recall equaled precision”, for both spam and ham; 
Table 0 shows the results achieved by the author’s best performing naive 
Bayes configuration (which we label as ‘NB'’) alongside our naive Bayes 
(NB*) and the suffix tree (ST) using a linear significance function and no 
normalisation. As can be seen, NB* achieves slightly worse results than the 
NB', while ST achieves slightly better results; but all are clearly competi¬ 
tive. And as a final example, in w\ the author applies developments and 
extentions of support vector machine algorithms 1^ to the Ling-Spam data 
set, albeit in a different experimental context, and achieves a minimum sum 
of errors of 6.42%; which is slightly worse than the results achieved by our 
NB* and ST classifiers. 

Thus, let us proceed on the assumption that both our (NB* and ST) clas- 
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Table 5: Classification errors 
by depth using a constant sig¬ 
nificance function, with no 
normalisation, and a threhsold 
of 1 on the LS-11 email data 
set. 


Depth 

FPR(%) 

FNR(%) 

2 

58.75 

11.75 

4 

0.25 

4.00 

6 

0.50 

2.50 

8 

0.75 

1.50 


sifiers are at least competitive enough for the task at hand: to investigate how 
their performance varies under experimental conditions for which results are 
not available in the literature. 


6.2 Analysis 

In the following tables, we group email data sets (EDSs), as in Table[n Sec- 
tion l5.2l by their source corpora, so that each of the EDSs in one group differ 
from each other only in the proportion of spam to ham they contain. 

6.2.1 Effect of Depth Variation 

For illustrative purposes, Table|3shows the results using the constant signif¬ 
icance function, with no normalisation using the LS-11 data set. Depths of 
2, 4, 6, and 8 are shown. 

The table demonstrates a characteristic which is common to all consid¬ 
ered combinations of significance and normalisation functions: performance 
improves as the depth increases. Therefore, in further examples, we con¬ 
sider only our maximum depth of 8. Notice also the decreasing marginal 
improvement as depth increases, which suggests that there may exist a max¬ 
imal performance level, which was not necessarily achieved by our trials. 

6.2.2 Effect of Significance Function 

We found that all the significance functions we tested worked very well, and 
all of them performed better than our naive Bayes. Figure|3shows the ROC 
curves produced by each signihcance function (with no normalisation) for 
what proved to be one of the most difficult data sets (SAeh-11: see Sec- 
tion l5.2> . 
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All Significance Functions - no normalisation, SAeh-11 



Figure 3: ROC curves for all significance functions on the SAch- 
11 data set. 
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Table 6: Sum of errors (FPR+FNR)values at a eonventional threshold 
of 1 for all signifieanee funetions under mateh permutation normal¬ 
isation. The best seores for eaeh email data set are highlighted in 
bold. 


EDS Code 

eonstant 

Sum o 

for s 

linear 

f Errors ( 
3eeifieati( 

square 

%) at t! 
)ns of 

wot 

r= 1 

m 

logit 

sigmoid 

LS-11 

1.5 

2.25 

2.5 

1.75 

1.75 

1.75 

LS-46 

1.33 

1.42 

1.92 

1.08 

1.58 

1.42 

LS-15 

1.33 

1.33 

1.55 

1.33 

1.55 

1.89 

SAe-11 

0.25 

0.5 

0.5 

0.5 

0.25 

0.75 

SAe-46 

0.5 

0.75 

0.5 

0.5 

0.25 

0.75 

SAe-15 

1.00 

1.50 

1.8 

1.5 

1.1 

2.00 

SAeh-11 

7.00 

7.00 

7.50 

6.75 

5.49 

6.50 

SAeh-46 

4.33 

4.58 

4.92 

5.00 

4.42 

4.92 

SAeh-15 

9.3 

7.5 

8.00 

7.7 

7.6 

8.6 

BKS-LS-11 

0 

0 

0 

0 

0 

0 

BKS-LS-46 

0 

0 

0 

0 

0 

0 

BKS-LS-15 

0 

1.5 

1.5 

1.00 

0 

1.5 

BKS-SAe-11 

4.75 

1.75 

1.5 

1.5 

1.5 

1.75 

BKS-SAe-46 

4.5 

1.75 

2.00 

1.75 

1.5 

2.75 

BKS-SAe-15 

9.5 

6.00 

6.00 

5.50 

5.50 

8.5 

BKS-SAeh-11 

9.25 

5.75 

7.25 

5.00 

5.75 

7.25 

BKS-SAeh-46 

10.25 

5.25 

7.00 

4.25 

5.00 

7.25 

BKS-SAeh-15 

15.5 

9.5 

9.5 

9.5 

9.5 

14.5 
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Table 7: Sum of error (FPR+FNR) values at individual optimal 
thresholds for all signiheanee funetions under mateh permutation 
normalisation. The best seores for eaeh data set are highlighted in 
bold. 


EDS Code 

eonstant 

Sum of 

for s 

linear 

irrors (% 
3eeifieati( 

square 

) at opt 
)ns of 

root 

imal th 

m 

logit 

sigmoid 

LS-11 

n 

1.00 

PI 

1.00 

1.00 

1.00 

LS-46 


1.08 


1.08 

1.08 

0.83 

LS-15 

1.33 

1.33 

m 

1.33 

1.33 

1.33 

SAe-11 

0.25 

0 

0 

0 

0 

0.25 

SAe-46 

0.42 

0.33 

0.33 

0.25 

0.25 

0.5 

SAe-15 

1.00 

1.3 

1.4 

1.1 

1.1 

1.2 

SAeh-11 

7.00 

6.50 

6.00 

6.25 

6.25 

6.50 

SAeh-46 

4.00 

4.58 

4.92 

4.42 

4.33 

4.92 

SAeh-15 

6.50 

6.60 

6.70 

6.50 

6.60 

6.30 

BKS-LS-11 

0 

0 

0 

0 

0 

0 

BKS-LS-46 

0 

0 

0 

0 

0 

0 

BKS-LS-15 

0 

0 

0 

0 

0 

0 

BKS-SAe-11 

0 

0 

0 

0 

0 

0 

BKS-SAe-46 

0 

0 

0 

0 

0 

0 

BKS-SAe-15 

0.2 

0 

0 

0 

0.2 

0.50 

BKS-SAeh-11 

2.75 

1.75 

2.00 

1.75 

2.00 

2.00 

BKS-SAeh-46 

1.33 

1.17 

1.50 

1.17 

1.00 

1.33 

BKS-SAeh-15 

1.1 

1.2 

2.00 

1.30 

1.1 

2.1 
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We found little difference between the performance of each of the func¬ 
tions across all the data sets we experimented with, as can be seen from 
the summary results in Table which shows the minimum sum of errors 
(FPRh-FNR) achieved at a threshold of 1.0 by each significance function on 
each data set. The constant function looks marginally the worst performer 
and the logit and root functions marginally the best, but this difference is 
partly due to differences in optimal threshold (see Section 103} for each 
function; those that perform less well at a threshold of 1.0 may perform bet¬ 
ter at other thresholds. 

Table ^presents the minimum sum of errors achieved by each function 
at its individual optimal threshold. In this table there is even less difference 
between the functions, but still the root looks marginally better than the oth¬ 
ers, in that it appears to most frequently achieve the lowest sum of errors, and 
so, for the sake of brevity we favour this function in much of our following 
analysis. 

6.2.3 Effect of Threshold Variation 

We generally found that there was an optimal threshold (or range of thresh¬ 
olds) which maximised the success of the classifier. As can be seen from the 
four example graphs shown in Figure|3 the optimal threshold varies depend¬ 
ing on the significance function and the mix of ham and spam in the training 
and testing sets, but it tends to always be close to 1. 

Obviously, it may not be possible to know the optimal threshold in ad¬ 
vance, but we expect, though have not shown, that the optimal threshold can 
be established during a secondary stage of training where only examples with 
scores close to the threshold are used - similar to what 1201 call “non-edge 
training”. 

In any case, the main reason for using a threshold is to allow a potential 
user to decide the level of false positive risk they are willing to take; reducing 
the risk carries with it an inevitable rise in false negatives. Thus we may 
consider the lowering of the threshold as attributing a greater cost to miss- 
classified ham (false positives) than to miss-classified spam; a threshold of 
1.0 attributes equal importance to the the two. 

The shapes of the graphs are typical for all values of (j>[p]', the perfor¬ 
mance of a particular scoring configuration is reflected not only by the min- 
imums achieved at optimal thresholds but also by the steepness (or shallow¬ 
ness) of the curves; the steeper they are, the more rapidly errors rise at sub- 
optimal levels, making it harder to achieve zero false positives without a 
considerable rise in false negatives. Graph (d) shows that our NB classifier is 
the most unstable in this respect. 
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Threshold 


Threshold 



Figure 4: Effect of threshold variation. Graphs (a-c) show suffix 
tree false positive (FP) and false negative (FN) rates for three 
specification of <p{p) under no normalisation; graph (d) shows 
naive Bayes FP and FN rates. 


29 







Effect of Match Normalisation - 0(p) = 1, SAeh-11 



Figure 5: ROC curves for the constant significance function un¬ 
der no match normalisation (MUN), match permutation normal¬ 
isation (MPN) and match length normalisation (MLN), with no 
tree normalisation, on the SAeh-11 data set. MLN has such a 
detrimental effect on performance that its ROC curve is off the 
scale of the graph. 


6.2.4 Effect of Normalisation 

We found that there was a consistent advantage to using match permuta¬ 
tion normalisation, which was able to improve overall performance as well 
as making the ST classifier more stable under varying thresholds. Figure 
shows the ROC curves produced by the constant significance function un¬ 
der match permutation normalisation (MPN); match length normalisation 
(MLN) reduced performance so much that the resulting curve does not even 
appear in the range of the graph. The stabilising effect of match permuta¬ 
tion normalisation is reflected in ROC curves by an increase in the number 
of points along the curve, but may be better seen in Figure|6las a shallowing 
of the FPR and FNR curves. The negative effect of MLN concurs with our 
heuristics from Section lA^ 
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a) Error Rates - 4){p) = 1, b) Error Rates - <p{p) = 1, 

unnormalised, LS-11 match permutation normalised, LS-11 




0.7 0.8 0.9 1 1.1 1.2 1.3 0.7 0.8 0.9 1 1.1 1.2 1.3 


Threshold 


Threshold 


Figure 6: Effect of match permutation normalisation. False pos¬ 
itive (FP) and false negative (FN) rates using a constant signifi¬ 
cance function on the FS-11 EDS. Graph (a) shows the false pos¬ 
itive (FP) and false negative (FN) rates under no normalisation 
and graph (b) shows FP and FN rates under match permutation 
normalisation. 


6.2.5 Effect of Spam to Ham Ratios 

We initially found that the mix of spam to ham in the data sets could have 
some effect on performance, with the degree of difference in performance 
depending on the data set and the significance function used; however, with 
further investigation we found that much of the variation was due to dif¬ 
ferences in the optimal threshold. This can be seen by first examining the 
differences in performance for different spamiham ratios shown in Table 
in which a 1:5 ratio appears to result in lower performance than the more bal¬ 
anced ratios of 4:6 and 1:1; then examining the results presented in Table0 
where differences are far less apparent. These observations are reinforced 
by the graphs shown in Figure 0 In graph (a) which shows the ROC curves 
produced by the constant significance function with no normalisation on the 
S Aeh data sets, we can see that the curves produced by different ratios appear 
to achieve slightly different maximal performance levels but roughly follow 
the the same pattern. Graphs (b-c) further show that the maximal levels of 
performance are achieved at different threshold for each ratio. 

6.2.6 Overall Performance Across Email Data Sets 

Tablel^summarises the results for both the ST and NB classifiers at a thresh¬ 
old of 1.0 and Tablel^summarises results at the individual optimal thresholds 
which minimise the sum of the errors (FPRh-FNR). 

We found that the performance of the NB is in some cases dramatically 
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Percentage Rate True Positive Rate (TPR) 


a) Effect of Spam;Ham ratios - 0(p) = 1, 
unnormalised 



False Positive Rate (FPR) 

c) Error Rates - 0(p) = 1, 
unnormalised, SAeh-46 


b) Error Rates - 0(p) = 1, 
unnormalised, SAeh-11 



Threshold 

d) Error Rates - 0(p) = 1, 
unnormalised, SAeh-15 



Threshold 


Threshold 


Figure 7: Effect of varying ratios of spam:ham on the SAeh 
data using a constant significance function with no normalisa¬ 
tion. Graph (a) shows the ROC curves produced for each ratio; 
while graphs (b-d) show the FP and FN rates separately for ratios 
of 1:1, 4:6 and 1:5 respectively. 
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Table 8: Classification errors at threshold of 1, for Naive Bayes 
(NB) and a Suffix Tree (ST) using a root significance function 
and match permutation normalisation, but no tree normalisation. 



Naive 

Bayes 

Suffix Tree 

EDS Code 


FNR (%) 

FPR (%) 

FNR (%) 

LS-11 

^9 

0.50 

1.00 

1.75 

LS-46 


1.25 

0.83 

0.25 

LS-15 

1.00 

1.00 

0.22 

0.13 

SAe-11 

0 

2.75 

0 

0.50 

SAe-46 

0.17 

2.00 

0 

0.50 

SAe-15 

0.30 

3.50 

0 

1.50 

SAeh-11 

10.50 

1.50 

3.50 

3.25 

SAeh-46 

5.67 

2.00 

2.00 

3.00 

SAeh-15 

4.10 

7.00 

0.70 

7.00 

BKS-LS-11 

0 

12.25 

0 

0 

BKS-LS-46 

0.17 

13.75 

0 

0 

BKS-LS-15 

0.20 

30.00 

0 

1.00 

BKS-SAe-11 

0 

9.00 

0 

1.50 

BKS-SAe-46 

0 

8.25 

0 

1.75 

BKS-SAe-15 

1.00 

15.00 

0 

5.5 

BKS-SAeh-11 

16.50 

0.50 

0 

5.00 

BKS-SAeh-46 

8.17 

0.50 

0 

4.25 

BKS-SAeh-15 

8.10 

5.50 

0 

9.50 
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Table 9: Classification Errors at optimal thresholds (where the sum of the errors is 
minimised) for Naive Bayes (NB) and a Suffix Tree (ST) using a roof significance 
function and match permutation normalisation, but no tree normalisation. 


EDS Code 

h 

OtpTh 

aive Bayes 
EPR (%) 

ENR (%) 

< 

OptTh 

suffix Tree 
EPR (%) 

ENR (%) 


1.0 

1.25 

0.50 

0.96 

0 

HQH 

ES-46 

1.02 

1.00 

0.67 

0.96 

0.33 


ES-15 

1.00 

1.00 

1.00 

0.98 - 1.00 

0.22 

1.11 

SAe-11 

1.06 

0.25 

0 

1.10 

0 


SAe-46 

1.04 

0.33 

0.25 

1.02 

0 


SAe-15 

1.02 

2.30 

1.50 

1.02 

0.1 

1.00 

SAeh-11 

0.98 

10.50 

1.50 

0.98 

2.75 

3.50 

SAeh-46 

1.00 

5.67 

2.00 

0.98 

1.16 

3.25 

SAeh-15 

1.02 

7.60 

1.50 

1.10 

3.50 

3.00 

BKS-ES-11 

1.04 

0.75 

2.25 

0.78 - 1.22 

0 

0 

BKS-ES-46 

1.06 

2.50 

1.25 

0.78 - 1.16 

0 

0 

BKS-ES-15 

1.10 

5.50 

1.50 

1.02- 1.22 

0 

0 

BKS-SAe-11 

1.04- 1.06 

0 

0.25 

1.04- 1.28 

0 

0 

BKS-SAe-46 

1.06 

0.50 

0.25 

1.18 - 1.28 

0 

0 

BKS-SAe-15 

1.04 

6.90 

0 

1.20 

0 

0 

BKS-SAeh-11 

0.98 

8.00 

2.00 

1.06 

0 

1.75 

BKS-SAeh-46 

0.98 

4.00 

3.75 

1.14- 1.16 

0.67 

0.5 

BKS-SAeh-15 

1.00 

8.10 

5.50 

1.24- 1.26 

0.80 

0.50 
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Table 10: Computational performance of suffix tree classification on four 
bare (no pre-processing) data sets. Experiments were run on apentium IV 
3GHz Windows XP laptop with 1GB of RAM. Averages are taken over 
all ten folds of cross-validation. 


EDS Code (size) 

Training 

AvSpam 

AvHam 

AvPeakMem 

ES-EUEE (7.40MB) 

63s 

843ms 

659ms 

765MB 

ES-11 (1.48MB) 

36s 

221ms 

206ms 

259MB 

SAeh-11 (5.16MB) 

155s 

504ms 

2528ms 

544MB 

BKS-ES-11 (1.12MB) 

41s 

161ms 

222ms 

345MB 


improved at its optimal threshold, for example in the case of the LS-BKS 
data sets. But at both a threshold of 1.0 and at optimal thresholds, the NB 
classifier behaves very much as expected, supporting our initial assumptions 
as to the difficulty of the data sets. This can be clearly seen in Table |9] 
on the SAeh data sets which contain ham with ’spammy’ features, the NB 
classiher’s false positive rate increases, meaning that a greater proportion of 
ham has been incorrectly classified as spam; and on the BKS-SAeh data sets 
which additionally contain spam which is disguised to appear as ham, the NB 
classiher’s false negative rate increases, meaning that a greater proportion of 
spam has been misclassihed as ham. 

The performance of the ST classiher also improves at its optimal thresh¬ 
olds, though not so dramatically, which is to be expected considering our un¬ 
derstanding of how it response to changes in the threshold (see Section l6.2.3t . 
The ST also shows improved performance on data sets involving BKS data. 
This may be because the character level analysis of the suffix tree approach 
is able to treat the attempted obfuscations as further positive distinguishing 
features, which do not exist in the more standard examples of spam which 
constitute the LS data sets. In all cases except on the SAeh data, the ST is 
able to keep the sum of errors close to or below 1.0, and in some cases, it is 
able to achieve a zero sum of errors. Furthermore, the suffix tree’s optimal 
performance is often achieved at a range of thresholds, supporting our earlier 
observation of greater stability in it’s classihcation success. 
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6.2.7 Computational Performance 

For illustrative purposes, in this section we provide some indication of the 
time and space requirements of the suffix tree (ST) classifier using a suffix 
tree of depth, d = 8. However, it should be stressed that in our implementa¬ 
tion of the ST classifiers we made no attempts to optimise our algorithms as 
performance was not one of our concerns in this paper. The figures quoted 
here may therefore be taken as indicators of worst-case performance levels. 

Table [TUI summarises the time and space requirements of the suffix tree 
classifier on four of our email data sets. The suffix tree approach clearly 
and unsurprisingly has high resource demands, far above the demands of a 
naive Bayes classiher which on the same machine typically uses no more than 
40MB of memory and takes approximately 10 milliseconds (ms) to make a 
classification decision. 

The difference in performance across the data sets is, however, exactly as 
we would expect considering our assumptions regarding them. The first point 
to note is that the mapping from data set size to tree size is non-linear. For 
example, the LS-FULL EDS is 5 times larger than the LS-11 EDS but results 
in a tree only 2.95 times larger. This illustrates the logarithmic growth of the 
tree as more information is added; the tree only grows to reflect the diversity 
(or complexity) of the training data it encounters and not the actual size of 
the data. Hence, though the BKS-LS-11 EDS is in fact approximately 25% 
smaller than the LS-11 data set, it results in a tree that is over 30% larger. We 
would therefore expect to eventually reach a stable maximal size once most 
of the complexity of the prohled class is encoded. 

The current space and time requirements are viable, though demanding, 
in the context of modern computing power, but a practical implementation 
would obvious benefit from optimisation of the algorithms 

Time could certainly be reduced very simply by implementing, for ex¬ 
ample, a binary search over the children of each node; the search is currently 
done linearly over an alphabet of approximately 170 characters (upper- and 
lower- case characters are distinguished, and all numerals and special char¬ 
acters are considered; the exact size of the alphabet depends on the specihc 
content of the training set). And there are several other similarly simple op¬ 
timisations which could be implemented. 

However, even with a fully optimised algorithm, the usual trade-off be¬ 
tween resources and performance will apply. With regard to this, an impor¬ 
tant observation is that resource demands increase exponentially with depth, 
whereas performance increases logarithmically. Hence an important factor 
in any practical implementation will be the choice of the depth of the suffix 
tree profiles of classes. 

^The literature on suffix trees deals extensively with improving (reducing) the resource demands 
of suffix trees l 28 l l^ lT 2 l . 
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7 Conclusion 


Clearly, the non-parametric suffix tree performs universally well across all 
the data sets we experimented with, but there is still room for improvement: 
whereas in some cases, the approach is able to achieve perfect classification 
accuracy, this is not consistently maintained. Performance may be improved 
by introducing some pre-processing of the data or post-processing of the suf¬ 
fix tree profile, and we intend to investigate this in future work. Certainly, the 
results presented in this paper demonstrate that the ST classifier is a viable 
tool in the domain of email filtering and further suggests that it may be useful 
in other domains. However, this paper constitutes an initial exploration of the 
approach and further development and testing is needed. 

In the context of the current work, we conclude that the choice of signifi¬ 
cance function is the least important factor in the success of the ST approach 
because all of them performed acceptably well. Different functions will per¬ 
form better on different data sets, but the root function appeared to perform 
marginally more consistently well on all the email data sets we experimented 
with. 

Match permutation normalisation was found to be the most effective 
method of normalisation and was able to improve the performance of all 
significance functions. In particular it was able to improve the success of the 
filter at all threshold values. However, other methods of normalisation were 
not always so effective, with some of them making things drastically worse. 

The threshold was found to be a very important factor in the success of 
the filter. So much so, that the differences in the performances of particu¬ 
lar configurations of the filter were often attributable more to differences in 
their corresponding optimal thresholds than to the configurations themselves. 
However, as a cautionary note, variations in the optimal threshold may be due 
to peculiarities of the data sets involved, and this could be investigated fur¬ 
ther. 

In the case of both the NB and ST filters, it is clear that discovering the 
optimal threshold - if it were possible - is a good way of improving perfor¬ 
mance. It may be possible to do this during an additional training phase in 
which we use some proportion of the training examples to test the filter and 
adjust the threshold up or down depending on the outcome of each test. Of 
course, the threshold may be continuously changing, but this could be han¬ 
dled to some extent dynamically during the actual use of the filter by contin¬ 
ually adjusting it in the light of any mistakes made. This would certainly be 
another possible line of investigation. 

We also found that the false positive rate (FPR) and false negative rate 
(FNR) curves created by varying the threshold, were in all cases relatively 
shallower for our ST classifier than those for our NB classifier, indicating 
that the former always performs relatively better at non-optimal thresholds, 
thereby making it easier to minimise one error without a significant cost in 
terms of the other error. 
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Finally, any advantages in terms of accuracy in using the suffix tree to fil¬ 
ter emails, must be balanced against higher computational demands. In this 
paper, we have given little attention to minimising this factor, but even though 
available computational power tends to increases dramatically, cost will nev¬ 
ertheless be important when considering the development of the method into 
a viable email filtering application, and this is clearly a viable line of further 
investigation. However, the computational demands of the approach are not 
intractable, and a suffix tree classiher may be valuable in situations where 
accuracy is the primary concern. 
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